Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79
Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79
Conversation
10 bite-sized tasks covering proper TCP windowing: - TcpNatEntry tracks guest_window (u32) + guest_window_scale (u8) - handle_tcp_frame parses tcp.window_scale() on guest SYN, stores per-flow; updates guest_window on every incoming frame - build_tcp_packet_static signature changes to take (window_len, window_scale) — caller decides - SYN-ACK negotiates OUR_WINDOW_SCALE = 7 (passt's default; 128x) - New host_recv_window helper queries TCP_INFO.tcpi_rcv_space and scales it for the advertised window on outgoing frames - relay_tcp_nat_data gates host→guest sends on entry.guest_window to honor real backpressure - Three new pins: tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE → flips at Task 7), tcp_window_scale_negotiated_in_synack, plus tcp_bulk_throughput_constrained_window parametric bench Severity: MEDIUM — perf gap. Hardcoded window_len: 65535 caps throughput at 64 KB / RTT regardless of bandwidth, and inject_to_guest can grow unbounded if the guest is slow.
Adds tcp_bulk_throughput_constrained_window bench that exercises the Task 7 window-gating path under three guest-window sizes (4096, 16384, 65536 bytes). Mirrors tcp_bulk_throughput_1mb with a parametric window so regressions in window-constrained relay show up numerically.
Profiling note: tcp_bulk_throughput_1mb regression root-causedFollowed up on the divan +1.6% / VM wall-clock -3.4% throughput regression with PMU summary
On-CPU flat hotspots
ConclusionThe throughput regression traces to one syscall, not the data-structure layout. Proposed follow-up (separate small PR, not blocking this one)Cache // On TcpNatEntry:
cached_recv_window: u16,
cached_recv_window_at: Instant,
// In the build_tcp_packet_static call sites for data/ACK frames:
const RECV_WINDOW_TTL: Duration = Duration::from_millis(5);
if entry.cached_recv_window_at.elapsed() > RECV_WINDOW_TTL {
entry.cached_recv_window = host_recv_window(entry.host_stream.as_raw_fd());
entry.cached_recv_window_at = Instant::now();
}The HashMap-flow-table cache-miss audit is still a worthwhile separate exercise, but the divan/wall-clock regression seen on this PR isn't traceable to it. IPC of 0.78 suggests we're modestly memory-bound elsewhere (likely the smoltcp wire-decode hot path), but the cache-miss rate doesn't indicate pathological structures. Profiles archived locally:
|
Profiling tcp_bulk_throughput_1mb showed __getsockopt at 5.7% flat CPU — Phase 6.3's host_recv_window was issuing one getsockopt(TCP_INFO) per outgoing TCP frame, costing ~10k syscalls/s at line rate. Cache the result on TcpNatEntry and refresh only every RECV_WINDOW_TTL (5 ms). At line rate this collapses to ~200 syscalls/s — a ~50x reduction — while the advertised window stays within 5 ms of reality, which is well below any realistic RTT. cached_recv_window is initialized at flow construction with one host_recv_window call so the first emitted frame doesn't pay the syscall cost on the data path either.
Cache fix landed and re-profiled — regression eliminatedCommit Divan microbenches — before/after the cache fix (vs current
|
| Bench | Pre-fix Δ% | Post-fix Δ% | Recovery |
|---|---|---|---|
tcp_bulk_throughput_1mb |
+1.6% | +0.1% | regression eliminated |
tcp_rx_latency_one_packet |
+6.7% | +2.3% | recovered 4.4 pp |
tcp_inbound_syn_ack_transition |
-19.4% | -30.5% | even faster post-fix |
process_icmp_echo_request |
+6.1% | +1.9% | recovered 4.2 pp |
flow_table_insert_remove/1000 |
+5.9% | -2.0% | now better than baseline |
Some flow-construction benches show small regressions (process_syn +4.6%, port_forward_accept_latency +6.1%, process_syn_during_pending_connects/0 +7.2%) — that's the one-time host_recv_window syscall now at flow-creation rather than per-frame. Pay-once-per-flow vs pay-per-packet is the right trade. At line rate (~10k packets/s, ~50 connects/s) this is a >100× syscall reduction.
VM wall-clock — before/after vs current main
| Metric | Pre-fix Δ% | Post-fix Δ% |
|---|---|---|
tcp_throughput_g2h_mbps |
-3.4% (5942 → 5739) | -0.2% (5776 → 5765) |
tcp_rr_latency_us_p50 |
-50% | parity (both at 2 µs) |
tcp_crr_latency_us_p50 |
parity | parity |
PMU — before/after (same 30s capture per side, single bench process)
| Metric | Pre-fix | Post-fix | Δ |
|---|---|---|---|
| IPC | 0.777 | 0.786 | +1.2% |
| Cache Misses / 1K instr | 3.666 | 3.924 | +7.0% (denominator effect) |
| Total Cache Misses (abs) | 86.83 M | 84.36 M | -2.85% |
| Total Instructions | 23.68 B | 21.50 B | -9.2% |
| Total Cycles | 30.47 B | 27.35 B | -10.3% |
| P99.9 on-CPU | 10.17 ms | 9.41 ms | -7.5% |
Total work dropped ~10% (less syscall traffic), IPC improved, and absolute cache misses fell 2.85%. The per-1K-instr rate ticked up because we removed a lot of cache-friendly syscall instructions from the denominator — the remaining mix is slightly more miss-dense but __getsockopt no longer dominates the on-CPU profile.
On-CPU top-7 — before/after
| Function | Pre-fix flat % | Post-fix flat % |
|---|---|---|
handle_tcp_frame |
26.70% | 25.00% |
__libc_recv (cum) |
29.90% | 35.71% |
__libc_send (cum) |
25.03% | 24.40% |
EpollDispatch::wait_with_timeout |
13.63% | 16.07% |
__getsockopt |
5.70% | — (gone from top-25) |
process_guest_frame |
5.84% | 4.46% |
drain_to_guest |
6.54% | 6.25% |
HashMap cache-miss hypothesis — verdict
At 3.92 cache-misses / 1K instructions (post-fix, well below the 10/1K threshold), the flow-table HashMap does not appear to be a dominant cache-pressure source for tcp_bulk_throughput_1mb. IPC of 0.786 says we're still mildly memory-bound, but it's not localised to the data structure. Hypothesis not confirmed by data on this workload. Worth re-investigating under different workloads (many concurrent flows, different per-entry sizes) but not blocking this PR.
Profiles archived locally:
- pre-fix:
/tmp/p63-bench-{cpu,offcpu,pmu}.{pb.gz,txt} - post-fix:
/tmp/p63-fixed-{cpu,offcpu,pmu}.{pb.gz,txt}
What this branch does
Stops ignoring the guest's advertised TCP window and stops hardcoding our own. Three correctness/perf gaps closed:
window_len(scaled bywindow_scalefrom SYN options) is stashed on the flow.relay_tcp_nat_datagatesframes_to_injectonguest_window - bytes_in_flight, so the relay stops when the guest's receive buffer is full instead of pretending it's infinite. Phase 3's 256 KB cap was a band-aid for the symptom.host_recv_window(fd)(computed fromgetsockopt(TCP_INFO).tcpi_rcv_space >> OUR_WINDOW_SCALE) instead of a hardcoded 65535. SYN-ACK negotiateswindow_scale: 7(matches passt; 128× → 8 MiB max).Headline win
inject_to_guestqueue (Phase 3 capped it at 256 KB userspace cliff)guest_window(modern Linux: 4 MB+ scaled)getsockopt(TCP_INFO).tcpi_rcv_spaceArchitecture
TcpNatEntry::guest_window: u32andguest_window_scale: u8(#[serde(default)]for snapshot back-compat with pre-6.3).tcp.window_scale()option and stashes it; every incoming frame refreshesentry.guest_window = u32::from(tcp.window_len()) << guest_window_scale.relay_tcp_nat_dataadds awindow_remaining = guest_window - bytes_in_flightgate; when zero, breaks out (waits for guest ACK).build_tcp_packet_staticsignature now takes(window_len, window_scale). SYN-ACK passes(65535, Some(7)); data/ACK frames pass(host_recv_window(fd), None).host_recv_window(fd) -> u16helper: onegetsockopt(IPPROTO_TCP, TCP_INFO, ...)call, returnstcpi_rcv_space >> 7clamped tou16::MAX. Falls back to 32768 on syscall error.Bench evidence — divan microbenches (vs current
main)scripts/bench-compare.sh --baseline origin/main --skip-vm:tcp_inbound_syn_ack_transitionprocess_udp_frameport_forward_accept_latencydns_cache_hitnat_translate_outbound_hot_pathflow_table_insert_remove/100process_syntcp_inbound_syn_ack_transitiongetsockopt(TCP_INFO)cost)tcp_bulk_throughput_1mbtcp_rx_latency_one_packetprocess_icmp_echo_requestflow_table_insert_remove/1000poll_with_n_mixed_flows/999tcp_bulk_throughput_constrained_window/4096tcp_bulk_throughput_constrained_window/16384tcp_bulk_throughput_constrained_window/65536Wall-clock VM harness (
voidbox-network-bench)tcp_rr_latency_us_p50tcp_rr_latency_us_p99tcp_crr_latency_us_p50tcp_throughput_g2h_mbpsThe 3.4% g2h throughput regression appears to be the per-outgoing-frame
getsockopt(TCP_INFO)syscall cost inhost_recv_window. Profiling planned as a follow-up — candidate fixes: cache the value with a 1 ms TTL, or move the syscall onto the net-poll thread's housekeeping cadence so the data path uses a stale-but-recent value. The trade is worth it for now: correct backpressure is a correctness fix, not a perf trick. Phase 6.4 epoll dispatch absorbs the latency improvements (RR p50 -50%) so the net change vs pre-Phase-6.x main is heavily positive.Snapshot interaction
Pre-6.3 snapshots restore cleanly: both new fields have
#[serde(default)]and default to(65535, 0)which is the pre-6.3 behavior (no scale, ignore guest window — same as if the entry was a Phase 6.0 entry). Verified via existingsnapshot_integrationsuite.passt-comparison status
Documented as a deferred task in
docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md("passt head-to-head methodology"). Methodology agreed: same hardware, two-column report, focus on CRR latency (apples-to-apples since CRR is dominated by NAT-table ops, not MMIO exit overhead). Building the passt+qemu reference harness is a separate follow-up PR.Commits (10)
Cherry-picked clean from
smoltcp-passt-port-phase6.3-window-mgmtonto currentmain(post-#78):docs: Phase 6.3 detailed TDD plan — TCP window managementfeat(slirp): TcpNatEntry tracks guest_window + guest_window_scalefeat(slirp): parse guest's window_scale on SYN, store on flowfeat(slirp): track guest's advertised window on every incoming framerefactor(slirp): build_tcp_packet_static takes (window_len, window_scale)feat(slirp): advertise host-kernel-derived window on outgoing framestest(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE)feat(slirp): gate host→guest send on guest's advertised window— flips the BROKEN_ON_PURPOSE pintest(network): pin tcp_window_scale_negotiated_in_synackbench(network): tcp_bulk_throughput_constrained_window parametricTest plan
cargo fmt --all -- --check— cleancargo clippy --workspace --all-targets --all-features -- -D warnings— cleanRUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features— cleancargo test --test network_baseline -- --test-threads=1— 24/24 (was 22; +2 window pins)cargo test --test network_baseline --features bench-helpers -- --test-threads=1— 26/26scripts/bench-compare.sh --baseline origin/main --skip-vm— see table abovescripts/bench-compare.sh --baseline origin/main --skip-divan(VM wall-clock) — see table aboveReplaces draft #75
Same window-management content via the now-superseded #74 chain. Close #75 once this lands.
Follow-ups (not blocking this PR)
host_recv_windowperf: profile the +1.6% bulk regression; cache TCP_INFO with short TTL or move into housekeeping cadence.HashMap<FlowKey, FlowEntry>) — separately tracked: data-path pollers do linear scans byFlowKeyvariant, which on a 1000-flow table at small entries is cache-unfriendly. Candidate: split into per-protocol maps or move to small-vector for low-flow-count sandboxes.